Skip to content

Remove deadlock when server is not acking our data #6107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 19, 2019

Conversation

sislakd
Copy link
Contributor

@sislakd sislakd commented May 17, 2019

If a server where WiFiClient or WiFiClientSecure is connected stop acking our outgoing tcp packets (network issue, server issue), write attempt can stuck forever (if nothing else is generating esp_schedule).

The PR propose to change esp_yield in ClientContext::_write_from_source to delay which can be interrupted by esp_schedule or after given timeout expires. In the next cycle _is_timeout() is matched and write attempt ends.

For WiFiClientSecure it took 2x 15 seconds to return from WiFiClientSecure::_write method if the remote peer is not acking out outgoing tcp packets and send buffer is already full. There is _run_until called from flush and the second _run_until called in the loop after which it get out of _write method because of -1. It seems that it would be good to have default timeout consistent with WiFiClient. Thus, PR proposes to reduce timeout from 15 seconds to 5 seconds in WiFiClientSecure after we have successful handshake with the remote peer.

@d-a-v d-a-v requested a review from earlephilhower May 17, 2019 13:10
Copy link
Collaborator

@earlephilhower earlephilhower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start, but a few problems I think I see:
. We need to reset _timeout in the ::connect/::connectServer methods since the object can be reused.
. The delay() change worries me (BW on small and large packets) and does not seem to be necessary.

Thx
-EFP3

@earlephilhower earlephilhower added this to the 2.5.2 milestone May 17, 2019
@earlephilhower
Copy link
Collaborator

@sislakd, I've tested it and can say that delay(xxx) is not interrupted on a packet and delay() needs to be undone.

#include <ESP8266WiFi.h>
WiFiServer server(80);
void setup() {
Serial.begin(115200);
WiFi.begin(SSID,PASS);
  while (WiFi.status() != WL_CONNECTED) delay(100);
Serial.print("\n\nconnected, ip =");
Serial.println(WiFi.localIP());
  pinMode(LED_BUILTIN, OUTPUT);     // Initialize the LED_BUILTIN pin as an output
  server.begin();
}

// the loop function runs over and over again forever
void loop() {
  digitalWrite(LED_BUILTIN, LOW);   // Turn the LED on (Note that LOW is the voltage level
  // but actually the LED is on; this is because
  // it is active low on the ESP-01)
  delay(10000);                      // Wait for a second
  digitalWrite(LED_BUILTIN, HIGH);  // Turn the LED off by making the voltage HIGH
  delay(10000);                      // Wait for two seconds (to demonstrate the active low LED)
}

Run the script and you'll see a 10s long high and low period. While it's going, do a NC/telnet/etc. to the 8266 port 80. If delay() was interrupted on a packet, the 10S period would be shortened when the connection came in. It is not, it still is 10s until the LED flips.

So, please undo the delay(timeout) change.

@sislakd
Copy link
Contributor Author

sislakd commented May 17, 2019

@earlephilhower Have you modified WiFiServer::_accept method to include esp_schedule() in your test? delay call is interrupted only if something call esp_schedule(). The current version of WiFiServer doesn't include esp_schedule(). Thus, you cannot see that behavior.

@sislakd
Copy link
Contributor Author

sislakd commented May 17, 2019

Revert of delay in _write_from_source method to esp_yield doesn't solve problem of deadlock when there is no ack coming. What will work is just yield() or optimistic_yield causing quick cycling in _write_from_source method allowing timeout to return from the method if the remote peer is not responding.

@earlephilhower
Copy link
Collaborator

@sislakd, I think maybe I wasn't clear. I was demonstrating that when you put in a delay(timeoutms) in the path, that delay will not return before the timeout, even if another packet or ack or whatever comes in. This is against the current head 968d6fc.

That's the current behavior and people do depend on it. delay(100) is around 100ms, and not sometimes 1ms. SWSerial, others, etc.

I'm not sure what you meant about changing WiFiClient::accept, but if you're talking about something which will globally cause delay() to now finish on an event on the TCP stack (and not time >delay) that's breaking things badly for the rest of the ecosystem.

So, what I'm saying is that delay(timeout) can only worsen TCP response since it will go idle no matter what even if an ack or whatever comes in, and that changing delay()'s semantics so that it sometimes returns as a random variable (since the TCP stack is async and anyone can send at any time) isn't going to work.

@sislakd
Copy link
Contributor Author

sislakd commented May 18, 2019

It seems that you don't understand what I'm trying to explain. I will try to summarize my thinking in details and hopefully it will be much clearer than before.

  • The key statement is that delay(ms) function called from application context returns earlier than defined ms interval if and only if function esp_schedule() is called from core context while application context is within delay function during that time.
  • The application context is code executed in setup() and loop() functions defined by application. On the other hand core context is everything else (code executed by core, interrupt handlers, system timers, ...).
  • Here is excerpt from core_esp8266_wiring.cpp and core_esp8266_main.cpp related to delay, esp_yield and esp_schedule functions:
void delay_end(void* arg) {
    (void) arg;
    esp_schedule();
}

void delay(unsigned long ms) {
    if(ms) {
        os_timer_setfn(&delay_timer, (os_timer_func_t*) &delay_end, 0);
        os_timer_arm(&delay_timer, ms, ONCE);
    } else {
        esp_schedule();
    }
    esp_yield();
    if(ms) {
        os_timer_disarm(&delay_timer);
    }
}

extern "C" void esp_yield() {
    if (cont_can_yield(g_pcont)) {
        cont_yield(g_pcont);
    }
}

extern "C" void esp_schedule() {
    ets_post(LOOP_TASK_PRIORITY, 0, 0);
}
  • The function delay(ms) called with non-zero ms internally configures os_timer (once expired configured delay_end function it is called in core context) and call esp_yield(). Within esp_yield() function it is checked that the function is called from application context (using cont_can_yield) and call cont_yield(...) functions. If you check definition of cont_yield(...) in cont.S, it implements context switch by storing register values, prepare return path back to application context, change stack pointer back to core context, and jump to cont_ret position in cont.S where registers from core context are reloaded and hit ret instruction returning back to core context code execution (the next line after cont_run(...) call within loop_task(...) function). Thus, during call of esp_yield() within delay(ms) function, application context code execution is blocked.
  • The way how the execution context code execution is resumed again is given by the call of esp_schedule() function. During this function there is scheduled the next execution of loop_task(...) from core context when possible. Once (some time later), the core context executes function loop_task(...) it calls cont_run(...) function which implements context switch back to application context. The context switch is similar to the one described in the previous bullet but in other direction (from core to application).
  • The timed delay is implemented using os_timer where the function delay_end(...) is called after timer expires. The function delay_end just call esp_schedule() function described in the previous bullet.
  • At this stage it should be clear that delay(ms) can return due to either expired os_timer (which calls internally esp_schedule()) or call of esp_schedule() from different place. Thus, if there is no esp_schedule() function call anywhere else delay(ms) implements timed sleep returning close to defined time interval. However, if you search Arduino code for usage of esp_schedule() function there are many places and some of them are called while code is executed in core context (these are those which can truncate delay and ends it before given interval elapses).
  • Now, it is crucial to know when such situation happens. Just existence of possible esp_schedule() calls doesn't mean that they are always executed. There are many conditions when this can happen.

Now back to your simple test application:

  • Within your loop() function you call delay(ms) and turn ON and OFF LED to see premature return from delay(ms). That's fine.
  • You expect to have delay(ms) call be interrupted by WiFiServer upon new connection attempt made to the listening tcp.
  • I'm saying that in your test application there is nothing (really nothing) calling esp_schedule() in the core context except delay_end(...) called from os_timer. Thus, running this application is not affected by whatever number of connection attempts to WiFiServer in your application.
  • If you would like to really see the interrupted return from delay(ms) in your test application, you have to modify the core code to make esp_schedule() call from core context upon external event (e.g. your connection attempt to WiFiServer). The simplest way to achieve this is to add call of esp_schedule() function in the beginning of WiFiServer::_accept method. This method is called from lwIP as registered connection accept event handler. I'm not going to describe internals of lwIP to describe that this event handler is called from tcp_input (which is called from WiFi frame processing via ip_input processing). The key is that WiFiServer::_accept method is called from core context and thus execution of your delay(ms) will be interrupted upon your connection attempt to WiFiServer then.

Now back to this PR:

  • I'm proposing to use delay(ms) function (instead of esp_yield()) within loop in ClientContext::_write_from_source(...) method.
  • You think that it will worsen TCP response time because incoming ack will not interrupt delay(ms) until given timeout elapses.
  • Thats not true. If you don't see that from my long explanation above, you can take it from different perspective. The current esp_yield() in ClientContext::_write_from_source(...) method is interrupted by call esp_schedule in ClientContext::_acked(...) method. And under good connection and other peer condition this works. Thus, it must work also using delay(ms) function instead of esp_yield(). As described above delay(ms) uses esp_yield() anyway.

Why this PR is proposed then:

  • If network or remote peer is not working well and our going tcp packets are not acked, the tcp send queue gets filled until there is no more space to accommodate data from ClientContext and ClientContext:: _write_from_source(...) method sets _send_waiting to true and call esp_yield() function. This blocks application context execution. With application not using any other libraries calling esp_yield() from the core context, there is no place where esp_yield() is resuming blocked execution of application context. Thus, the code is stuck in esp_yield() forever in ClientContext:: _write_from_source(...). There are no more loops and thus it is not possible to timeout at all.
  • Even if you have multiple WiFiClient and/or WiFiClientSecure objects, no other can call esp_yield(). There is only one application context which is stuck in esp_yield() in one ClientContext instance. Only this one instance has field _send_waiting set to true. All other have it false. Thus, even there is ClientContext::_acked(...) called in different ClientContext instance (because there is ACK on that connection), no esp_schedule() is called (given by the condition in ClientContext::_write_some_from_cb()) as there is not fulfilled condition having _send_waiting==true.
  • Thus, proposed usage of delay(ms) instead of esp_yield() in ClientContext:: _write_from_source(...) method will unblock either upon freeing some space in tcp send buffer (because of esp_schedule() in ClientContext::_acked(...)) or after elapsing given timeout.
  • Timed based end of delay(ms) is crucial to allow loop continue, detect timeout and return back to WiFiClient or WiFiClientSecure to handle further this case.

Let me know whether this explanation makes sense to you and you understand why this PR is important to fix ClientContext implementation.

Sorry for the long comment, but I don't know how much details from other code you know in details.

@earlephilhower
Copy link
Collaborator

@sislakd, thanks for your detailed explanation. The LWIP and packet interface is one place I've thankfully not had to dive into, and I think I understand where you're going here now.

I was talking w /@devyte about this as well and I can see that the pattern of "delay(timeout)" with a "esp_schedule()" to short-circuit it is used in the ::connect method. Your last bit is clear and seems fine now that I get it.

Thus, proposed usage of delay(ms) instead of esp_yield() in ClientContext:: _write_from_source(...) method will unblock either upon freeing some space in tcp send buffer (because of esp_schedule() in ClientContext::_acked(...)) or after elapsing given timeout.

I appreciate your recent PRs, but I do hope you understand that I'm only trying to be extremely careful with such a critical bit of the code and not trying to discourage you!

@earlephilhower
Copy link
Collaborator

To @d-a-v for final signoff.

@d-a-v
Copy link
Collaborator

d-a-v commented May 18, 2019

@sislakd

Let me know whether this explanation makes sense to you and you understand why this PR is important to fix ClientContext implementation.

Sorry for the long comment, but I don't know how much details from other code you know in details.

After @earlephilhower 's question, your explanation is crystal clear.
I would have liked some comments around the delay() change stating something like "will be interrupted by esp_schedule on next received ack".

You are mastering the nonos model of this core, also network management internals.
Thanks for this second PR.

As a side note, I'm a bit concerned about how _timeout is managed. It is defined by Stream:: (1000 by default), possibly modified by user and overwritten during SSL handshake (to 15000 with good reasons) and now per this PR restored to 5000. I'll leave @earlephilhower think of it since this is very specific to BearSSL client.

@earlephilhower
Copy link
Collaborator

earlephilhower commented May 18, 2019

_timeout is much longer because SSL handshake is very slow. AXTls started it and BearSSL just took it over.

After handshake, things go much faster. AES isn't nearly as expensive.

@sislakd
Copy link
Contributor Author

sislakd commented May 19, 2019

I've added comment to introduced delay to be understandable while reading the code.

@d-a-v d-a-v merged commit 912c0db into esp8266:master May 19, 2019
earlephilhower added a commit to earlephilhower/Arduino that referenced this pull request May 20, 2019
Changes since 2.5.1 (to 2.5.2)

Core
----
* Add explicit Print::write(char) (esp8266#6101)

Build system
----
* Fix typo in elf2bin for QOUT binary generation (esp8266#6116)
* Support PIO Wl-T and Arduino -T linking properly (esp8266#6095)
* Allow *.cc files to be linked into flash by default (esp8266#6100)
* Use custom "ElfToBin" builder for PIO (esp8266#6091)
* Fail if generated JSON file cannot be read (esp8266#6076)
* Moved 'Dropping' print from stdout to stderr in drop_versions.py (esp8266#6071)
* Fix PIO issue when build environment contains spaces (esp8266#6119)

Libraries
----
* Remove deadlock when server is not acking our data (esp8266#6107)
* Bugfix for stuck in write method of WiFiClient and WiFiClientSecure until the remote peer closed connection (esp8266#6104)
* Re-add original SD FAT info access methods (esp8266#6092)
* Make FILE_WRITE append in SD.h wrapper (esp8266#6106)
* Drop X509 after connection, avoid hang on TLS broken (esp8266#6065)
@earlephilhower earlephilhower mentioned this pull request May 20, 2019
earlephilhower added a commit that referenced this pull request May 20, 2019
Changes since 2.5.1 (to 2.5.2)

Core
----
* Add explicit Print::write(char) (#6101)

Build system
----
* Fix typo in elf2bin for QOUT binary generation (#6116)
* Support PIO Wl-T and Arduino -T linking properly (#6095)
* Allow *.cc files to be linked into flash by default (#6100)
* Use custom "ElfToBin" builder for PIO (#6091)
* Fail if generated JSON file cannot be read (#6076)
* Moved 'Dropping' print from stdout to stderr in drop_versions.py (#6071)
* Fix PIO issue when build environment contains spaces (#6119)

Libraries
----
* Remove deadlock when server is not acking our data (#6107)
* Bugfix for stuck in write method of WiFiClient and WiFiClientSecure until the remote peer closed connection (#6104)
* Re-add original SD FAT info access methods (#6092)
* Make FILE_WRITE append in SD.h wrapper (#6106)
* Drop X509 after connection, avoid hang on TLS broken (#6065)
@d-a-v d-a-v mentioned this pull request Jun 20, 2019
mcspr pushed a commit that referenced this pull request Oct 16, 2021
esp_yield() now also calls esp_schedule(), original esp_yield() function renamed to esp_suspend().

Don't use delay(0) in the Core internals, libraries and examples. Use yield() when the code is
supposed to be called from CONT, use esp_yield() when the code can be called from either CONT or SYS.
Clean-up esp_yield() and esp_schedule() declarations across the code and use coredecls.h instead.

Implement helper functions for libraries that were previously using esp_yield(), esp_schedule() and
esp_delay() directly to wait for certain SYS context tasks to complete. Correctly use esp_delay()
for timeouts, make sure scheduled functions have a chance to run (e.g. LwIP_Ethernet uses recurrent)

Related issues:
- #6107 - discussion about the esp_yield() and esp_delay() usage in ClientContext
- #6212 - discussion about replacing delay() with a blocking loop
- #6680 - pull request introducing LwIP-based Ethernet
- #7146 - discussion that originated UART code changes
- #7969 - proposal to remove delay(0) from the example code
- #8291 - discussion related to the run_scheduled_recurrent_functions() usage in LwIP Ethernet
- #8317 - yieldUntil() implementation, similar to the esp_delay() overload with a timeout and a 0 interval
hasenradball pushed a commit to hasenradball/Arduino that referenced this pull request Nov 18, 2024
esp_yield() now also calls esp_schedule(), original esp_yield() function renamed to esp_suspend().

Don't use delay(0) in the Core internals, libraries and examples. Use yield() when the code is
supposed to be called from CONT, use esp_yield() when the code can be called from either CONT or SYS.
Clean-up esp_yield() and esp_schedule() declarations across the code and use coredecls.h instead.

Implement helper functions for libraries that were previously using esp_yield(), esp_schedule() and
esp_delay() directly to wait for certain SYS context tasks to complete. Correctly use esp_delay()
for timeouts, make sure scheduled functions have a chance to run (e.g. LwIP_Ethernet uses recurrent)

Related issues:
- esp8266#6107 - discussion about the esp_yield() and esp_delay() usage in ClientContext
- esp8266#6212 - discussion about replacing delay() with a blocking loop
- esp8266#6680 - pull request introducing LwIP-based Ethernet
- esp8266#7146 - discussion that originated UART code changes
- esp8266#7969 - proposal to remove delay(0) from the example code
- esp8266#8291 - discussion related to the run_scheduled_recurrent_functions() usage in LwIP Ethernet
- esp8266#8317 - yieldUntil() implementation, similar to the esp_delay() overload with a timeout and a 0 interval
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants